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Abstract 

Many popular linear classifiers, such as logistic regression, boosting, or SVM, are trained 
by optimizing a margin-based risk function. Traditionally, these risk functions are computed 
based on a labeled dataset. We develop a novel technique for estimating such risks using only 
unlabeled data and the marginal label distribution. We prove that the proposed risk estimator 
is consistent on high-dimensional datasets and demonstrate it on synthetic and real- world data. 
In particular, we show how the estimate is used for evaluating classifiers in transfer learning, 
and for training classifiers with no labeled data whatsoever. 

1 Introduction 



Many popular linear classifiers, such as logistic regression, boosting, or SVM, are trained by 
optimizing a margin-based risk function. For standard linear classifiers Y = sign'^ 9 jXj with 
Y G {— 1,-|-1}, and X,6 the margin is defined as the product 

d 

YfeiX) where /^(X) ^ (1) 

i=i 



*To whom correspondence should be addressed. Email: IkrishnaLkumarSQgatech . edu| 



Training such classifiers involves choosing a particular value of 6. This is done by minimizing the 
risk or expected loss 



R{e) = Ep(^x,Y)L(Y,fe{X)) (2) 

with the three most popular loss functions 

L,{Y,fe{X))=exp{-Yfe{X)) (3) 

L2iY, fe{X)) = log (1 + exp {-Y fe{X))) (4) 

L^{Y,fg{X)) = {l-Yfe{X))+. (5) 



being exponential loss Li (boosting), logloss L2 (logistic regression) and hinge loss L3 (SVM) 
respectively (^+ above corresponds to ^ if A > and otherwise). 

Since the risk R{0) depends on the unknown distribution p, it is usually replaced during training 
with its empirical counterpart 

1 " 

Rn{e) = -Y.L{Y^'\fe{X^'^)) (6) 
1=1 

based on a labeled training set 

(x(i),y(i)),...,(x("),y(")) ~p (7) 

leading to the following estimator 

9n = argmini2,„(6'). 

e 

Note, however, that evaluating and minimizing i?„ requires labeled data d?]). While suitable in 
some cases, there are certainly situations in which labeled data is difficult or impossible to obtain. 
In this paper we construct an estimator for R{0) using only unlabeled data, that is using 

~p (8) 

instead of d?]). Our estimator is based on the observations that when the data is high dimensional 
((i — 7- 00) the quantities 

fe{X)\{Y = y}, ye{-l,+l} (9) 

are often normally distributed. This phenomenon is supported by empirical evidence and may 
also be derived using non-iid central limit theorems. We then observe that the limit distributions 
of dH]) may be estimated from unlabeled data ([8]) and that these distributions may be used to 
measure margin-based losses such as (I3|)-(l5|). We examine two novel unsupervised applications: 
(i) estimating margin-based losses in transfer learning and (ii) training margin-based classifiers. 
We investigate these applications theoretically and also provide empirical results on synthetic and 
real-world data. Our empirical evaluation shows the effectiveness of the proposed framework in risk 
estimation and classifier training without any labeled data. 

The consequences of estimating R{9) without labels are indeed profound. Label scarcity is a 
well known problem which has lead to the emergence of semisupervised learning: learning using a 
few labeled examples and many unlabeled ones. The techniques we develop lead to a new paradigm 
that goes beyond semisupervised learning in requiring no labels whatsoever. 



2 



2 Unsupervised Risk Estimation 



In this section we describe in detail the proposed estimation framework and discuss its theoretical 
properties. Specifically, we construct an estimator for R{6) ([2]) using the unlabeled data ([8]) which 
we denote Rn{6]X^^\ . . . or simply hniP) (to distinguish it from Rn in ^). 

Our estimation is based on two assumptions. The first assumption is that the label marginals 
p(Y) are known and that p(Y = 1) ^ p{Y = — 1). While this assumption may seem restrictive at 
first, there are many cases where it holds. Examples include medical diagnosis {p{Y) is the well 
known marginal disease frequency), handwriting recognition or OCR {piY) is the easily computable 
marginal frequencies of different letters in the English language), life expectancy prediction {piY) 
is based on marginal life expectancy tables). In these and other examples p{Y) is known with great 
accuracy even if labeled data is unavailable. Furthermore, this assumption may be replaced with a 
weaker form in which we know the ordering of the marginal distributions e.g., p(Y = 1) > p(Y = 
— 1), but without knowing the specific values of the marginal distributions. 

The second assumption is that the quantity fQ{X)\Y follows a normal distribution. As fe{X)\Y 
is a linear combination of random variables, it is frequently normal when X is high dimensional. 
From a theoretical perspective this assumption is motivated by the central limit theorem (CLT). 
The classical CLT states that fe{X) = Yli=i ^^i^il'^ is approximately normal for large d if the data 
components Xi, . . . , X^ are iid given Y. A more general CLT states that fg{X)\Y is asymptotically 
normal if Xi, . . . ,Xd\Y are independent (but not necessary identically distributed). Even more 
general CLTs state that f0{X)\Y is asymptotically normal if Xi, . . . , X(i\Y are not independent but 
their dependency is limited in some way. We examine this issue in Section 12.11 and also show that 
the normality assumption holds empirically for several standard datasets. 

To derive the estimator we rewrite ([2]) by taking expectation with respect to Y and a = fg{X) 



Equation (llOp involves three terms L{y,a), p{y) and p{fg{X) = a\y). The loss function L 
is known and poses no difficulty. The second term p{y) is assumed to be known (see discussion 
above). The third term is assumed to be normal fe{X) \ {Y = y} = Yli (^iXi \ {Y = y} ~ X{ny, ay) 
with parameters ^iy,ay, y £ {—1,1} that are estimated by maximizing the likelihood of a Gaussian 
mixture model. These estimated parameters are used to construct the plug-in estimator Rn{0) as 
follows. 



m 




(10) 



n 






(11) 



i = l + 

argmax^n(^, a) 



(12) 



Rn{0) 




(13) 



We make the following observations. 



1. Although we do not denote it explicitly, fiy and ay are functions of 9. 
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2 



The loglikelihood (jlip does not use labeled data (it marginalizes over the label yW). 



3. The parameter of the loglikelihood (jlip are fi = and a = ((7i,(7_i) rather than the 
parameter 9 associated with the margin-based classifier. We consider the latter one as a fixed 
constant at this point. 

4. The estimation problem (I12p is equivalent to the problem of maximum likelihood for means 
and variances of a Gaussian mixture model where the label marginals are assumed to be 
known. It is well known that in this case (barring the symmetric case of a uniform p{y)) the 
MLE converges to the true parameter values. 

5. The estimator i?„ (|13p is consistent in the limit of infinite unlabeled data 



6. The two risk estimators Rri{0) (fT3]l and Rn{6) & approximate the expected loss R{6). The 
latter uses labeled samples and is typically more accurate than the former for a fixed n. 

7. Under suitable conditions argmin^ i?„(0) converges to the expected risk minimizer 



This far reaching conclusion implies that in cases where ai g uiing R{6) is the Bayes classifier 
(as is the case with exponential loss, log loss, and hinge loss) we can retrieve the optimal 
classifier without a single labeled data point. 

2.1 Asymptotic Normality of /e(X)|y 

The quantity fg{X)\Y is essentially a sum of d random variables which for large d is likely to 
be normally distributed. One way to verify this is empirically, as we show in Figures [T]l2] which 
contrast the histogram with a fitted normal pdf for text, digit images, and face images data. For 
these datasets the dimensionality d is sufficiently high to provide a nearly normal fg{X)\Y . For 
example, in the case of text documents {Xi is the relative number of times word i appeared in 
the document) d corresponds to the vocabulary size which is typically a large number in the range 
10^ — 10^. Similarly, in the case of image classification [Xi denotes the brightness of the i-pixel) 
the dimensionality is on the order of lO'^ — 10^. 

Figures [l]l2] show that in these cases of text and image data fg{X)\Y is approximately normal 
for both randomly drawn 9 vectors (Figured]) and for 9 representing estimated classifiers (Figure[2]). 
The single caveat in this case is that normality may not hold when 6 is sparse, as may happen for 
example for li regularized models (last row of Figure [2]). 

From a theoretical standpoint normality may be argued using a central limit theorem. We 
examine below several progressingly more general central limit theorems and discuss whether these 
theorems are likely to hold in practice for high dimensional data. The original central limit theorem 
states that X^^Li Zi is approximately normal for large d if Zi are iid. 

Proposition 1 (de-Moivre). If Zi,i € N are iid with expectation fi and variance and Z^ = 
d~^ X^iLi Zi then we have the following convergence in distribution 




P ( lim argmin Rn{9) 




argmin R{9) 1=1. 
6>G0 / 



Vd{Zd - fi)/a ^ iV(0, 1) 



as d oo. 
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RCVl text data 



face images 




-5 5 -5 5 -5 5 



MNIST handwritten digit images 

Figure 1: Centered histograms of f0{X)\{Y = 1} overlayed witli tlie pdf of a fitted Gaussian for randomly 
drawn 9 vectors (Oi ~ C/(-l/2, 1/2)). The columns represent datasets (RCVl text data [6], MNIST digit 
images, and face images 0) and the rows represent multiple random draws. For uniformity we subtracted 
the empirical mean and divided by the empirical standard deviation. The twelve panels show that even in 
moderate dimensionality (RCVl: 1000 top words, MNIST digits: 784 pixels, face images: 400 pixels) the 
assumption that fg(X)\Y is normal holds often for randomly drawn 0. 
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RCVl text data 



face images 




O 
h-1 



bO 



MNIST handwritten digit images 



Figure 2: Centered histograms of fg{X)\{Y = 1} overlayed witii tlie pdf of a fitted Gaussian for multiple 
9 vectors (four rows: Fisher's LDA, logistic regression, I2 regularized logistic regression, and h regularized 
logistic regression-all regularization parameters were selected by cross validation) and datasets (columns: 
RCVl text data f^, MNIST digit images, and face images [Q). For uniformity we subtracted the empirical 
mean and divided by the empirical standard deviation. The twelve panels show that even in moderate 
dimensionality (RCVl: 1000 top words, MNIST digits: 784 pixels, face images: 400 pixels) the assumption 
that fg{X)\Y is normal holds well for fitted 6 values (except perhaps for li regularization in the last row 
which promotes sparse 0). 
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As a result, the quantity Yli=i (which is a Unear transformation of \fd{Z^ — fi) /a) is approx- 
imately normal for large d. This relatively restricted theorem is unlikely to hold in most practical 
cases as the data dimensions are often not iid. 

A more general CLT does not require the summands Zi to be identically distributed. 

Proposition 2 (Lindberg). For Zi,i G N independent with expectation in and variance af, and 
denoting = Yli=i '^^ have the following convergence in distribution os d — >• oo 

d 
i=l 

if the following condition holds for every e > 

d 

lim E{Z, - f,i)%x.-,^^^es,} = 0. (14) 

i=l 

This CLT is more general as it only requires that the data dimensions be independent. The 
condition (|14|) is relatively mild and specifies that contributions of each of the Zi to the variance 
Sd should not dominate it. Nevertheless, the Lindberg CLT is still inapplicable for dependent data 
dimensions. 

More general CLTs replace the condition that Zi,i G N be independent with the notion of 
m(fc)-dependence. 

Definition 1. The random variables Zi,i € N are said to be m(A;)-dependent if whenever s — r > 
m{k) the two sets {Zi, . . . , Zr}, {Zg, . . . , Zk} are independent. 

An early CLT for ?n(A;)-dependent RVs is [5j. Below is a slightly weakened version of the CLT 
in [2]. 

Proposition 3 (Berk). For each k £N let d{k) and m{k) be increasing sequences and suppose that 
z[''\ . . . , zjllj^^ is an m{k) -dependent sequence of random variables. If 

1. E|zf^|2 < M for alii and k 

2. Var{Z^l\ + ... + zf^) < {j - i)K for all i,j, k 

3. limfc_!.oo Var{z[''^ + . . . + z'^J^^)/d{k) exists and is non-zero 

4. limk^^ m"^ (k) / d{k) = 

then ^^1— i - is asymptotically normal as k ^ 00. 

Proposition [3] states that under mild conditions the sum of m(A:)-dependent RVs is asymptot- 
ically normal. If m{k) is a constant i.e., m{k) = m, m(A;)-dependence implies that a Zi may only 
depend on its neighboring dimensions. Or in other words, dimensions that are removed from each 
other are independent. The full power of Proposition [3] is invoked when m{k) grows with k relax- 
ing the independence restriction as the dimensionality grows. Intuitively, the dependency of the 
summands is not fixed to a certain order, but it cannot grow too rapidly. 
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A more realistic variation of m{k) dependence where the dependency of each variable is specified 
using a dependency graph (rather than each dimension depends on neighboring dimensions) is 
advocated in a number of papers, including the following recent result by |10j . 

Definition 2. A graph Q = {V,£) indexing random variables is called a dependency graph if for 
any pair of disjoint subsets of V, Ai and A2 such that no edge in £ has one endpoint in Ai and the 
other in A2, we have independence between {Zi : i E Ai} and {Zi : i E A2}. The degree d{v) of a 
vertex is the number of edges connected to it and the maximal degree is max^jgy d[v). 

Proposition 4 (Rinott). Let he random variables having a dependency graph whose 

maximal degree is strictly less than D, satisfying \Zi — EZi\ < B a.s., \/i, E{Yll=i^i) ~ ^ '^^^ 
\/ar(X;r=i Zi) = cr^ > 0, Then for any u; E M, 

The above theorem states a stronger result than convergence in distribution to a Gaussian in 
that it states a uniform rate of convergence of the CDF. Such results are known in the literature 
as Berry Essen bounds. When D and B are bounded and Var(^"^^ Zi) = 0{n) it yields a CLT 
with an optimal convergence rate of n~^/^. 

The question of whether the above CLTs apply in practice is a delicate one. For text one can 
argue that the appearance of a word depends on some words but is independent of other words. 
Similarly for images it is plausible to say that the brightness of a pixel is independent of pixels 
that are spatially far removed from it. In practice one needs to verify the normality assumption 
empirically, which is simple to do by comparing the empirical histogram of feiX) with that of a 
fitted mixture of Gaussians. As the figures above indicate this holds for text and image data for 
most values of 9, assuming it is not sparse. 

2.2 Unsupervised Consistency of Rn{0) 

We start with proving identifiability of the maximum likelihood estimator (MLE) for a mixture of 
two Gaussians with known ordering of mixture proportions. Invoking classical consistency results in 
conjunction with identifiability we show consistency of the MLE estimator for (/x, a) parameterizing 
the distribution of fQ{X)\Y. As a result consistency of the estimator Rn{6) follows. 

Definition 3. A parametric family {p^ ■ a E A} is identifiable when Pa{x) = Pai{x),\/x implies 
a = a' . 

Proposition 5. Assuming known label marginals with p{Y = 1) 7^ p{Y = — 1) the Gaussian 
mixture family 

Pi^A^) =p(.y = +p(.y = -1)^(2; (15) 

is identifiable. 

Proof. It can be shown that the family of Gaussian mixture model with no apriori information about 
label marginals is identifiable up to a permutation of the labels y [llj . We proceed by assuming 
with no loss of generality that p{y = 1) > p{y = — 1). The alternative case p{y = 1) < p{y = — 1) 
may be handled in the same manner. Using the result of [11] we have that p^^^ix) = p^'^^'ix) 



P 



< w 



a 



^(w) 
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for all X, then {p{y),^i,a) = {p{y) , I-l' , a') up to a permutation of the labels. Since permuting 
the labels violates our assumption p{y = 1) > p{y = —1) we establish (/i, o") = (/u',cr') proving 
identifiability. □ 

The assumption that p{y) is known is not entirely crucial. It may be relaxed by assuming that 
it is known whether p{Y = 1) > p{Y = —1) or p{Y = 1) < p{Y = —1). Proving Proposition [5] 
under this much weaker assumption follows identical lines. 

Proposition 6. Under the assumptions of Propositionl^the MLE estimates for (fi, a) = {fxi, fi^i,ai,a^i) 
(^W^^(n)^ = argmax4,(/^,cj) (16) 

n 

£„(^,a) = J^log Piy^'^)P,y,'^yifeiX^'W% (17) 

are consistent i.e., //^^ , o'x"^ af"]^) converge as n ^ oo to the true parameter values with 

probability 1. 

Proof. Denoting Prj{z) = Yl,yP{y)ViJ.y,ay{z\y) with rj = (/i, a) we note that is identifiable (see 
Proposition [5]) in rj and the available samples z*^*^ = fg{X^^^) are iid samples from p-qiz). We 
therefore use standard statistics theory which indicates that the MLE for identifiable parametric 
model is strongly consistent [4]. □ 



Proposition 7. Under the assumptions of Proposition\^ and assuming the loss L is given by one 



of ©-([S]) with a normal fe{X)\Y ~ N{iiy,a'l), the plug-in risk estimate 



ye{-i,+i} 

is consistent, i.e., for all 9, 



Rn{0)= Y] Piy) [ Pa(n)^cn){fe{X) = a\y)L{y,a)da. (18) 



p(limRn{e) = R{9)^ = 1. 



Proof. The plug- in risk estimate Rn in (jlSp is a continuous function (when L is given by ([3]), ([4]) 
or dS])) of fi^I^\,^^i\cr^^i (note that and ay are functions of 9), which we denote Rn{9) = 

MAS"\/i5,ai"\a3). 

Using Proposition [6] we have that 

hm (A^\/iL1,ai"\a3) = (/.r,M*-T,<^r,^^T) 

n— s-oo 

with probability 1. Since continuous functions preserve limits we have 

hm MAj"\/iL1,ai"\a3) = M/.r,M*-T,<^r,^*-T) 

n-i-oo 

with probability 1 which implies convergence lim^^oo -^n(^) = with probability 1. □ 
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2.3 Unsupervised Consistency of argmin i?„(6') 

The convergence above Rn{G) -^(^) is pointwise in 9. If the stronger concept of uniform con- 
vergence is assumed over G we obtain consistency of arg min^ This surprising result 
indicates that in some cases it is possible to retrieve the expected risk minimizer (and therefore the 
Bayes classifier in the case of the hinge loss, log-loss and exp-loss) using only unlabeled data. We 
show this uniform convergence using a modification of Wald's classical MLE consistency result [4]. 
Denoting 

Pvi^)= Piy)Pf^y,ayifiX) = Z\y), 77 = (/Ul,/X_l,CJl,CJ_l) 

ye{-i,+i} 

we first show that the MLE converges to the true parameter value fin — )• r]Q uniformly. Uniform 
convergence of the risk estimator Rn{0) follows. Since changing ^ € results in a different rj £ E 
we can state the uniform convergence in € or alternatively in r] £ E. 

Proposition 8. Let 6 take values in for which rj £ E for some compact set E. Then assuming 
the conditions in Proposition the convergence of the MLE to the true value fjn — ?• rjQ is uniform 
in rjQ £ E (or alternatively 9 £ @). 

Proof. We start by making the following notation 

U{z,T],r]o) = log Pr,{z) - logpr,o{z) 

a(7?,7?o) = Ep^^U{z,r],l]o) = -D{pr,o,Pr,) < 

with the latter quantity being non-positive and iff r/ = r/o (due to Shannon's inequality and 
identifiability of p^). 

For p > we define the compact set S^q.p = {v ^ ^ '■ ll^^'yoll ^ p}- Since a{rj,r]Q) is continuous 
it achieves its maximum (with respect to r]) on 5r,o,p denoted by Sp{r]o) = maXf^g^^^ ^ a{ri,rio) < 
which is negative since 0(77,770) = iff 77 = 7^9. Furthermore, note that 6p{rio) is itself continuous in 
rjQ £ E and since E is compact it achieves its maximum 

6 = max(5p(7yo) = max max 0(77,7^0) < 

which is negative for the same reason. 

Invoking the uniform strong law of large numbers [3] we have SILi ^ {^^^^ iViVo) ~^ Vo) 
uniformly over (r/, r/o) ^ E'^- Consequentially, there exists N such that for n > N (with probability 
1) 

1 

sup sup - U{z'-^\r],rio) < 6/2 < 0. 

VO&E v&S^o.P " i=l 

But since SILi U{z^^\ri, 770) — )• for 77 = 770 it follows that the MLE 

1 " 

77„ = max -"S^ U{z'^'\r],r]o) 

i=l 

is outside 5*^0, p (for n > N uniformly in r/o £ E) which implies ||7?n — %|| < P- Since p > is 
arbitrarily and N does not depend on 770 we have ?}„ — t- 770 uniformly over rjQ £ E. □ 
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Proposition 9. Assuming that X,Q are bounded in addition to the assumptions of Proposition 
the convergence Rn{S) — )• R{S) is uniform in 9 & Q. 

Proof. Since X, Q are bounded the margin value fe{X) is bounded with probability 1. As a result 
the loss function is bounded in absolute value by a constant C. We also note that a mixture of two 
Gaussian model (with known mixing proportions) is Lipschitz continuous in its parameters 



ye{-i,+i} " ' " ye{-i,+i} 

which may be verified by noting that the partial derivatives of P'q{z) = '^yP{y)p^y,ayiz\y) 



(2^)V2^(")' 
p{y = -l){z - fi^!^}) 



(27r)i/2^(_"f 



dfj,\ 

dpr,{z) 
dpr,{z) 

dprjjz) _ p{y = -l){z - 



(27r)3/2al 



in) 



6 



(2^)3/2<7i"f 



-e -1 



are bounded for a compact i?. These observations, together with Proposition [8] lead to 

\Rnid) - R{0)\ < Y P^y) j P^W = a) -p^truo^^truo(/e(X) = a) |L(y,a)|(i: 

ye{-i,+i} 



j/G{-i,+i} " " ?;6{-i,+i} 



uniformly over 9 £ Q. 

Proposition 10. Under the assumptions of Proposition 



P I lim argmini?n(^) = argmini2(0) 



□ 



11 



Proof. We denote t* = avg mm R{9), tn = argmini2„(0). Since Rn{0) — ?• R{G) uniformly, for each 
e > there exists N such that for all n> N, \Rn{0) - R{9)\ < e. 

Let S = {9 : \\0 — t*\\ > e} and mingg^ > R{t*) {S is compact and thus R achieves its 
minimum on it). There exists N' such that for all n > N' and 6 € S, Rn{9) > R{t*) + e. On the 
other hand, Rn{t*) — )■ R{t*) which together with the previous statement implies that there exists 
A^" such that for n > N" , Rn{t*) < Rn{9) for all 6* G 5. We thus conclude that for n > N", t„ S. 
Since we showed that for each e > there exists A'^ such that for all n > we have \\tn — t*\\ < e, 
tn — ^ t* which concludes the proof. □ 



2.4 Asymptotic Variance 



In addition to consistency, it is useful to characterize the accuracy of our estimator Rn{9) as a 
function of p{y),^i,a. We do so by computing the asymptotic variance of the estimator which 
equals the inverse Fisher information 

and analyzing its dependency on the model parameters. We first derive the asymptotic variance of 
MLE for mixture of Gaussians (we denote below r] = {rji,rj2),i]i = 

Pr,iz)=piY = l)Niz;fii,al)+p{Y = -l)iV(z; ,) (19) 

= PiPvi i^) + P~iPv-i (^)- (20) 

The elements of 4 x 4 information matrix I(rj) 

'd log pr,{z) dlogPnizY 



drji 



dm 



may be computing using the following derivatives 

d\0gpri{z) _ Pi f Z- fli\ Pr,,{z) 



Pr,{z) 



d\0gpr,{z) ^Pi_(f Z- ^ij V _ \ Priiiz) 

daf 2ai \\ en J ) Pr,{z) 



for i = 1,-1. Using derivations similar to the ones in [T] we obtain 



MA) 



Mu[Pr,Az)^Pr,^{z) 



Mi2[pr,,{z),pr,^{z)j - Mio[pr,^iz) , Pr,^{z) 
M2I (Pr,, (z) , Pr,_i (2;)) - Moi (^p^^ {z) , p^_^ (z) 



PiPj 
PlPi 

P-iPi 

^[Moo{p^Sz),PvAz)) - 2Mn ( (z) , (z) ) + M22 ( Prj, (z) ,pr,,{z) 

PiP~i 

- Mo2[Pr^Az)^Pn~i{z)) + M22{PriAz),Pri-i{z) 



Mooiprji (^),Pr,_i (z) - M20 [pr,^ iz),Pr,_i (z) 
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where 



CJ,; I \ O 



In some cases it is more instructive to consider the asymptotic variance of the risk estimator 
Rn{d) rather than that of the parameter estimate for rj = {fJ-jCr). This could be computed using 
the delta method and the above Fisher information matrix 

V^{Rn{e) - R{e)) ^ iv(o, v/i(7?*™^)^/-^(?7*™")v/i(?f 

where V/i is the gradient vector of the mapping R{9) = h[r]). For example, in the case of the 
exponential loss ([3]) we get 

h{^)=p{Y = l)a,V2e.p (^^^^ " ^) + p{Y = -l)a-i^exp (^^^^^^ - ^ 



dhiv) _ V2P{Y = l){f,,{al - 1) - al) _ /(^i - 1)^ 



exp 



dfii ai V 2 2ai 

dh{r,) _ V2P{Y = -l){^^^{al^ - 1) + al^) Hj^^^ + V)^ _ 

dhjrj) _ P{Y = l){fij+al) f {f,,-l)^ /ij 
daf V2oT ^ 2 2c7f 

dhjrj) ^ P(y = -l)(/z^+a^) / (;u-i + l)^ _ ^ 
^/2air V 2 2(t2^ 

Figure [3] plots the asymptotic accuracy of Rn{0) for log-loss. The left panel shows that the 
accuracy of i?„ increases with the imbalance of the marginal distribution p(Y). The right panel 
shows that the accuracy of Rn increases with the difference between the means |^i — /i-i] and the 
variances cri/o"2. 

2.5 Multiclass Classification 

Thus far, we have considered unsupervised risk estimation in binary classification. In this section 
we describe a multiclass extension based on standard extensions of the margin concept to multiclass 
classification. In this case the margin vector associated with the multiclass classifier 

Y = arg max fgk {X) , X, 9^ eW^ (21) 

k=l,...,K 

is fe{X) = {fgi{X), . . . , fgK^X)). Following our discussion of the binary case, fgk{X)\Y, k = 
1, . . . , is assumed to be normally distributed with parameters that are estimated by maximizing 
the likelihood of a Gaussian mixture model. We thus have K Gaussian mixture models, each one 
with K mixture components. The estimated parameters are plugged-in as before into the multiclass 
risk 

Rie) = Ep^j^^x),YMYJe{X)) (22) 
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Figure 3: Left panel: asymptotic accuracy (inverse of trace of asymptotic variance) of Rn{0) for 
logloss as a function of the imbalance of the class marginal p(Y). The accuracy increases with the 
class imbalance as it is easier to separate the two mixture components. Right panel: asymptotic 
accuracy (inverse of trace of asymptotic variance) as a function of the difference between the means 
l/ii — and the variances ui/(T2- See text for more information. 

where L is a multiclass margin based loss function such as 

L(y, fe{X)) = log(l + exp(-/,.(X))) (23) 

L{YJe{X)) = Y,i^ + fe.{X))+. (24) 

Since the MLE for a Gaussian mixture model with K components is consistent (assuming P{Y) is 
known and all probabilities PiY = k),k = 1, . . . ,K are distinct) the MLE estimator for fgk (-^)|^ = 
k' are consistent. Furthermore, if the loss L is a continuous function of these parameters (as is the 
case for (j23p -(f24l)) the risk estimator Rn{0) is consistent as well. 

3 Application 1: Estimating Risk in Transfer Learning 

We consider applying our estimation framework in two ways. The first application, which we 
describe in this section, is estimating margin-based risks in transfer learning where classifiers are 
trained on one domain but tested on a somewhat different domain. The transfer learning assumption 
that labeled data exists for the training domain but not for the test domain motivates the use of 
our unsupervised risk estimation. The second application, which we describe in the next section, 
is more ambitious. It is concerned with training classifiers without labeled data whatsoever. 

In evaluating our framework we consider both synthetic and real-world data. In the synthetic 
experiments we generate high dimensional data from two uniform distributions -'^ll^ = 1} and 
X|{y = —1} with independent dimensions and prescribed p{Y) and classification accuracy. This 
controlled setting allows us to examine the accuracy of the risk estimator as a function of n, p(Y), 
and the classifier accuracy. 

Figure m shows that the relative error of Rn{0) (measured by \Rn{0) — Rn{0)\ / Rn{0)) in esti- 
mating the logloss (left) and hinge loss (right) decreases with n achieving accuracy of greater than 
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Figure 4: The relative accuracy of i2„ (measured by \Rn{0) — Rn{(^)\/ Rn{S)) as a function of n, 
classifier accuracy (acc) and the label marginal (left: logloss, right: hinge-loss). The estimation 
error nicely decreases with n (approaching 1% at n = 1000 and decaying further). It also decreases 
with the accuracy of the classifier (top) and non-uniformity of p{Y) (bottom) in accordance with 
the theory of Section I2.4[ 



99% for n > 1000. In accordance with the theoretical results in Section 12.41 the figure shows that 
the estimation error decreases as the classifiers become more accurate and as p{Y) becomes less 
uniform. We found these trends to hold in other experiments as well. In the case of exponential 
loss, however, the estimator performed substantially worse (figure omitted). This is likely due to 
the exponential dependency of the loss on YfolX) which makes it very sensitive to outliers. 

Figure [5] shows the accuracy of logloss estimation for a real world transfer learning experiment 
based on the 20-newsgroup data. Following the experimental setup of [3] we trained a classifier 
(logistic regression) on one 20 newsgroup classification problem and tested it on a related problem. 
Specifically, we used the hierarchical category structure to generate train and testing sets with 
different distributions (see Figure [5] and [3] for more detail) . The unsupervised estimation of the 
logloss risk was very eff'ective with relative accuracy greater than 96% and absolute error less than 
0.02. 



15 



Data 


Rn 


1 Rn Rn 1 


1 Rn Rn 1 / Rn 


n 


p{Y = 1) 


sci vs. comp 


0.7088 


0.0093 


0.013 


3590 


0.8257 


sci vs. rec 


0.641 


0.0141 


0.022 


3958 


0.7484 


talk vs. rec 


0.5933 


0.0159 


0.026 


3476 


0.7126 


talk vs. comp 


0.4678 


0.0119 


0.025 


3459 


0.7161 


talk vs. sci 


0.5442 


0.0241 


0.044 


3464 


0.7151 


comp vs. rec 


0.4851 


0.0049 


0.010 


4927 


0.7972 



Figure 5: Error in estimating logloss for logistic regression classifiers trained on one 20- newsgroup 
classification task and tested on another. We followed the transfer learning setup described in 
which may be referred to for more detail. The train and testing sets contained samples from two 
top categories in the topic hierarchy but with different subcategory proportions. The first column 
indicates the top category classification task and the second indicates the empirical log-loss Rn 
calculated using the true labels of the testing set ([6]). The third and forth columns indicate the 
absolute and relative errors of Rn- The fifth and sixth columns indicate the train set size and the 
label marginal distribution. 

4 Application 2: Unsupervised Learning of Classifiers 

Our second application is a very ambitious one: training classifiers using unlabeled data by minimiz- 
ing the unsupervised risk estimate 6n = argmini?„(^). We evaluate the performance of the learned 
classifier On based on three quantities: (i) the unsupervised risk estimate Rn{9n), (ii) the supervised 
risk estimate Rn{On), and (iii) its classification error rate. We also compare the performance of 
On = argmin Rn{0) with that of its supervised analog argmin Rn{0). 

We compute On = argmin using two algorithms (see Algorithms [T][2|) that start with an 
initial 9^^^ and iteratively construct a sequence of classifiers 9^^\ . . . ,0^"^^ which steadily decrease 
Rn- Algorithm [1] adopts a gradient descent-based optimization. At each iteration t, it approximates 
the gradient vector VRn{G^^'^) numerically using a finite difference approximation ()25p . Algorithm [2] 
proceeds by constructing a grid search along every dimension of 0^*^ and set [0^^^]i to the grid value 
that minimizes Rn- Although we focus on unsupervised training of logistic regression (minimizing 
unsupervised logloss estimate), the same techniques may be generalized to train other margin-based 
classifiers such as SVM by minimizing the unsupervised hinge-loss estimate. 

Figures [6][7] display Rn{On), Rnifin) and error-rate (0„) on the training and testing sets as on two 
real world datasets: RCVl (text documents) and MNIST (handwritten digit images) datasets. In 
the case of RCVl we discarded all but the most frequent 504 words (after stop-word removal) and 
represented documents using their tfidf scores. We experimented on the binary classification task of 
distinguishing the top category (positive) from the next 4 top categories (negative) which resulted 
in p[y = 1) = 0.3 and n = 199328. 70% of the data was chosen as a (unlabeled) training set and the 
rest was held-out as a test-set. In the case of MNIST data, we normalized each of the 28 x 28 = 784 
pixels to have mean and unit variance. Our classification task was to distinguish images of the 
digit one (positive) from the digit 2 (negative) resulting in 14867 samples and p(y = 1) = 0.53. 
We randomly choose 70% of the data as a training set and kept the rest as a testing set. 

Figures [MT] indicate that minimizing the unsupervised logloss estimate is quite effective in 
learning an accurate classifier without labels. Both the unsupervised and supervised risk estimates 
Rn{On), Rn{On) decay nicely when computed over the train set as well as the test set. Also inter- 
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Algorithm 1 Unsupervised Gradient Descent 



Input: X^^\. . . e M'^, p{Y), step size a 

repeat 

Initialize t = 0, 6'(*) = 9^ eR'^ 
Compute /0(t)(X(j)) = Vj = 1, . . . ,?i 

Estimate (/ii, iti, (T_i) by maximizing ([12]) 
for i = 1 to d do 

Plug-in the estimates into (jlSp to approximate 



d9i 2hi 
(ej is an all zero vector except for [e^Jj = 1) (25) 



end for 

? fa(t)\ — f9Rn__ 

^ 

Update = 0W - qV^„(0(*)), i = t + 1 



Form Vi?„(0W) = (^^^^^ ^^^^^^ 



until convergence 

Output: linear classifier 6^^^^ = 6^*^ 



Algorithm 2 Unsupervised Grid Search 

Input: X(^), . . . E M'^, p(y), grid-size r 

Initialize 0j ~ Uniform(— 2, 2) for all i 
repeat 

for i = 1 to d do 

Construct r points grid in the range [9i — At, Oi + 4r] 

Compute the risk estimate (jlSp where all dimensions of are fixed except for which 

is evaluated at each grid point. 

Set [9^^^^^\i to the grid value that minimized psp 
end for 
until convergence 
Output: linear classifier Q^^^^ = 
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Figure 6: Performance of unsupervised logistic regression classifier 0„ computed using Algorithm [T] 
(left) and Algorithm [2] (right) on the RCVl dataset. The top two rows show the decay of the two 
risk estimates Rn{9n), Rn{0 

n) as a function of the algorithm iterations. The risk estimates of On 
were computed using the train set (top) and the test set (middle). The bottom row displays the 
decay of the test set error rate of On as a function of the algorithm iterations. The figure shows that 
the algorithm obtains a relatively accurate classifier (testing set error rate 0.1, and Rn decaying 
similarly to Rn) without the use of a single labeled example. For comparison, the test error rate 
for supervised logistic regression with the same n is 0.07. 
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Figure 7: Performance of unsupervised logistic regression classifier On computed using Algorithm [T] 
(left) and Algorithm [2] (right) on the MNIST dataset. The top two rows show the decay of the two 
risk estimates Rni9n), Rn{Qn) 

as a function of the algorithm iterations. The risk estimates of On 
were computed using the train set (top) and the test set (middle). The bottom row displays the 
decay of the test set error rate of On as a function of the algorithm iterations. The figure shows that 
the algorithm obtains a relatively accurate classifier (testing set error rate 0.1, and Rn decaying 
similarly to Rn) without the use of a single labeled example. For comparison, the test error rate 
for supervised logistic regression with the same n is 0.05. 
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Figure 8: Performance of unsupervised classifier training on RCVl data (top class vs. classes 2-5) for 
misspecified p{Y). The performance of the estimated classifier (in terms of training set empirical 
logloss Rn dS]) and test error rate measured using held-out labels) decreases with the deviation 
between the assumed and true p{Y = 1) (true p{Y = 1) = 0.3)). The classifier performance is very 
good when the assumed p{Y) is close to the truth and degrades gracefully when the assumed p{Y) 
is not too far from the truth. 

esting is the decay of the error rate. For comparison purposes supervised logistic regression with 
the same n achieved only slightly better test set error rate: 0.05 on RCVl (instead of 0.1) and 0.07 
or MNIST (instead of 0.1). 

4.1 Inaccurate Specification of p{Y) 

Our estimation framework assumes that the marginal p{Y) is known. In some cases we may only 
have an inaccurate estimate ofp{Y). It is instructive to consider how the performance of the learned 
classifier degrades with the inaccuracy of the assumed p{Y). 

Figure [8] displays the performance of the learned classifier for RCVl data as a function of the 
assumed value oi p{Y = 1) (correct value is p(Y = 1) = 0.3). We conclude that knowledge p{Y) 
is an important component in our framework but precise knowledge is not crucial. Small deviations 
of the assumed p(Y) from the true p(Y) result in a small degradation of logloss estimation quality 
and testing set error rate. Naturally, large deviation of the assumed p{Y) from the true p(Y) 
renders the framework ineffective. 

5 Related Work 

Related problems have been addressed in [7] and [9]. The work in performs transduction by 
enforcing constraints on the label proportions. However, their method requires labeled data. The 
work in [9] aims to estimate the labels of an unlabeled testing set using known label proportions of 
n sets of unlabeled observations. The key difference between their approach and ours is that they 
require as many splits of the data as the number of classes and therefore require the knowledge 
of the label proportions in each split. This is a much stronger assumption than knowing p{y). As 
noted previously (see comment after Proposition [5|), our analysis is in fact valid when only the 
order of label proportions is known, rather than the absolute values. 
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An important distinction between our work and the references above is that our work provides 
an estimate for the margin-based risk and therefore leads naturally to unsupervised versions of 
logistic regression and support vector machines. We also provide asymptotic analysis showing 
convergence of the resulting classifier to the optimal classifier (minimizer of ([2])). Experimental 
results show that in practice the accuracy of the unsupervised classifier is on the same order (but 
slightly lower naturally) as its supervised analog. 

6 Discussion 

In this paper we developed a novel framework for estimating margin-based risks using only unlabeled 
data. We shows that it performs well in practice on several different datasets. We derived a 
theoretical basis by casting it as a maximum likelihood problem for Gaussian mixture model followed 
by plug-in estimation. 

Remarkably, the theory states that assuming normality of fe{X) and a known p{Y) we are able 
to estimate the risk R{0) without a single labeled example. That is the risk estimate converges 
to the true risk as the number of unlabeled data increase. Moreover, using uniform convergence 
arguments it is possible to show that the proposed training algorithm converges to the optimal 
classifier as n — )• oo without any labeled data. 

On a more philosophical level, our approach points at novel questions that go beyond supervised 
and semi-supervised learning. What benefit do labels provide over unsupervised training? Can 
our framework be extended to semi-supervised learning where a few labels do exist? Can it be 
extended to non-classification scenarios such as margin based regression or margin based structured 
prediction? When are the assumptions likely to hold and how can we make our framework even 
more resistant to deviations from them? These questions and others form new and exciting open 
research directions. 
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